Lift hc2_bm + weights gates via clubSandwich WLS-CR2 port by igerber · Pull Request #475 · igerber/diff-diff

igerber · 2026-05-20T18:28:56Z

Summary

Lifts the two NotImplementedError gates in _validate_vcov_args blocking vcov_type="hc2_bm" + weights (TODO.md rows 104-105, "Gates 4 and 5")
Ports clubSandwich's specific WLS-CR2 algebra into _compute_cr2_bm / _compute_cr2_bm_contrast_dof / _compute_bm_dof_from_contrasts (W not √W in hat matrix, W² in bias-correction term, unweighted residuals in score, full H1/H2/H3 array Satterthwaite DOF)
Threads weighted Bell-McCaffrey DOF through LinearRegression._bm_dof and LinearRegression.get_inference()
Adds NaN-guard for FP-noise-floor high-leverage FE-dummy contrasts (where BLAS reduction order differences between NumPy and R produce 15-30% DOF disagreement despite vcov matching at machine precision)
6 new weighted scenarios in benchmarks/data/clubsandwich_cr2_golden.json pin Python vs R parity at atol=1e-10 (vcov + non-noise-floor DOF + compound-contrast DOF); existing 6 unweighted scenarios unchanged
Public surface restrictions: weight_type ∈ {"aweight", "fweight"} + hc2_bm + weights raises NotImplementedError (port matches pweight only); zero-total-weight clusters dropped with effective-cluster ≥ 2 guard

Methodology references

Method: WLS-CR2 Bell-McCaffrey cluster-robust variance + Satterthwaite DOF
Source: clubSandwich v0.7.0 (Pustejovsky 2024) R source — R/CR-adjustments.R::CR2, R/clubSandwich.R::vcov_CR, R/coef_test.R::Satterthwaite_df, R/get_arrays.R::get_GH. Foundational papers: Bell & McCaffrey (2002), Pustejovsky & Tipton (2018) JBES, Imbens & Kolesar (2016) ReStat.
Intentional deviation from textbook: The diff-diff port matches clubSandwich's specific algebra, NOT a textbook PT2018 §3.3 transform-once derivation. Per prior empirical finding (feedback_wls_cr2_clubsandwich_parity), the textbook reading diverges from clubSandwich by 0.5-30% on weighted designs. clubSandwich uses W (not √W) in the hat matrix, W² in the bias term, and unweighted residuals in the score construction. Documented in docs/methodology/REGISTRY.md Phase 1a section.
Documented precision limit: high-leverage FE-dummy contrasts (e.g., unit dummies in a balanced DiD) have Satterthwaite DOF at the float64 noise floor; the helper returns NaN with a UserWarning rather than ship BLAS-implementation-dependent values.

Validation

Tests added: tests/test_methodology_wls_cr2.py (19 tests: clubSandwich parity at atol=1e-10 across 6 weighted scenarios + compound-contrast DOF + unweighted regression safety + dual-path equivalence + weight-type rejection + zero-weight cluster rejection + LinearRegression _bm_dof threading + LinearRegression NaN inference end-to-end + FE-dummy noise-floor guard)
Tests updated: tests/test_linalg_hc2_bm.py flipped two "gate raises NotImplementedError" tests to "gate lifted, produces finite vcov+DOF" smoke tests
Step 0 R/Python smoke parity verified at atol=1e-15 before source edits (per feedback_r_source_smoke_test_before_implementing)
Local codex review iterated R1-R8 to ✅ clean (8 rounds)
340 tests pass across linalg + methodology + estimators + TWFE + SA + estimators-vcov-type

Security / privacy

Confirm no secrets/PII in this PR: Yes (canonical secret-pattern scan clean)

Test plan

Verify R-parity tests pass: pytest tests/test_methodology_wls_cr2.py
Verify unweighted regression safety: pytest tests/test_linalg_hc2_bm.py
Verify broader regression: pytest tests/test_estimators_vcov_type.py tests/test_methodology_twfe.py tests/test_methodology_sun_abraham.py tests/test_estimators.py
CI codex review (local convergence is necessary but not sufficient per feedback_local_codex_vs_ci_codex_divergence)

🤖 Generated with Claude Code

`vcov_type="hc2_bm" + weights` (both one-way and cluster-robust) is now supported, matching `clubSandwich::vcovCR(..., type="CR2") + coef_test(test= "Satterthwaite")$df_Satt` and `Wald_test(test="HTZ")$df_denom` at atol=1e-10 on six new weighted scenarios in clubsandwich_cr2_golden.json. Immediate UX benefit: DifferenceInDifferences, MultiPeriodDiD, and TwoWayFixedEffects now accept `vcov_type="hc2_bm" + survey_design= SurveyDesign(weights=...)` for analytical weights. Closes TODO.md rows 104-105 (open weighted-CR2 gates). Algorithm note: the diff-diff form matches clubSandwich's specific algebra (W not sqrt(W) in hat matrix, W² in bias term, unweighted residuals in score), NOT a textbook Pustejovsky-Tipton (2018) §3.3 transform-once derivation - the two diverge by 0.5-30% on weighted designs per feedback_wls_cr2_clubsandwich_parity. Satterthwaite DOF uses the full H1/H2/H3 array construction (clubSandwich get_arrays.R::get_GH), not the simpler (tr B)²/tr(B²) form (which is exact unweighted but diverges from clubSandwich on weighted designs by ~6%). Step 0 R smoke test validated the algorithm at atol=1e-15 before source edits per feedback_r_source_smoke_test_before_implementing. Unweighted CR2-BM is bit-equal to prior at atol=1e-14 (regression-safe via TestUnweightedRegressionStillBitEqual + TestDOFFormulaDualPathEquivalence asserting the simple and P_array DOF formulas agree at the unweighted limit). clubSandwich version pin: >= 0.7.0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…usters, P2 docstrings) P0 (weight_type contract gap): The clubSandwich WLS-CR2 port matches the `pweight` (sampling-weight) convention only. The dispatcher now rejects `vcov_type="hc2_bm" + weights + weight_type in {"aweight", "fweight"}` with NotImplementedError pointing to `weight_type="pweight"` or `vcov_type="hc1"` (CR1 supports all three weight types) as workarounds. P1 (zero-total-weight clusters): `_compute_cr2_bm` and `_compute_cr2_bm_contrast_dof` now drop zero-total-weight clusters before the G>=2 check, raising ValueError when fewer than 2 effective clusters remain. Mirrors the CR1 zero-cluster handling. Three-cluster fits where one cluster has all-zero weight silently drop it (its scores contribute zero anyway). P2 (stale docstrings): Updated `_validate_vcov_args` and `solve_ols` docstrings to reflect the lifted-gate + pweight-only scope. Removed the contradictory "Not supported with weights" claim that survived the gate lift. New tests in `tests/test_methodology_wls_cr2.py`: - TestWLSCR2WeightTypeRejection: 4 tests (aweight/fweight rejections on cluster and one-way paths; pweight smoke acceptance test). - TestWLSCR2ZeroWeightClusterRejection: 3 tests (one-zero-cluster reject on both `_compute_cr2_bm` and `_compute_cr2_bm_contrast_dof`; multi- cluster with one zero-weight silent drop). All 53 linalg+methodology tests pass; broader 336-test regression suite across estimators / TWFE / MPD / SA / vcov_type also clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-cluster + scope tightening) R2 P0: LinearRegression.fit() previously skipped populating self._bm_dof when both effective_cluster_ids AND _fit_weights were present (the path the R1 clubSandwich port lifted), so get_inference() fell back to df = n - k and produced anti-conservative p-values / CIs on the weighted-cluster hc2_bm surface. The dispatcher already guards non-pweight weighted hc2_bm at the linalg validator level, so reaching the _bm_dof branch guarantees a finite Satterthwaite DOF. Drop the weighted-cluster skip and populate _bm_dof from compute_robust_vcov(..., return_dof=True) like the other hc2_bm paths. R2 P2: new regression tests TestLinearRegressionWeightedClusterHC2BM (2 tests) verify LinearRegression._bm_dof matches compute_robust_vcov-level Satterthwaite DOF and that get_inference(index=i).df threads correctly per coefficient. Sanity check: cluster-driven DOF << n-k (catches future regressions where the fallback would otherwise re-emerge). R2 P3: stale docstrings at solve_ols (linalg.py:1260) and LinearRegression class docstring (linalg.py:2852) updated to reflect the lifted hc2_bm + pweight surface and the documented aweight/fweight restriction. R2 P3: CHANGELOG and REGISTRY entries reworded to scope the lift to the analytical surface (compute_robust_vcov / solve_ols / LinearRegression direct callers + analytical CR2 contrast DOF in MPD). Removed the incorrect claim that survey_design= callers benefit directly — survey designs route through the Taylor-series linearization (TSL) survey variance path, which takes precedence over the analytical CR2 sandwich (unchanged). All 194 linalg/methodology regression tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e comments) R3 was ✅ "Looks good" with only 2 P3 informational items; addressing both for cleanliness. P3 #1 (dead-code thread, estimators.py:1893): The R1 patch threaded `weights=survey_weights` into `_compute_cr2_bm_contrast_dof` on the `not _use_survey_vcov` branch, but that branch only fires when `survey_design=` is unset, in which case `survey_weights` is always None (survey designs always route through the TSL `_use_survey_vcov=True` path). The threading was a no-op and made the surface look like it supported weighted MPD avg_att via survey_design — which it doesn't. Removed the kwarg and updated the comment to reflect the de facto contract on the analytical branch. P3 #2 (R-script comment scope, generate_clubsandwich_golden.R): comments on `weighted_did_absorbed_fe` and `weighted_mpd_avg_att_dof` said the fixtures pin `DiD/MPD(survey_design=SurveyDesign(weights="w"))` paths. Reworded to say these are analytical-CR2 design-matrix parity fixtures on DiD/MPD-shaped designs (the public surface they actually pin is `compute_robust_vcov` / `solve_ols` / `LinearRegression` / the analytical CR2 contrast-DOF helper). All 144 linalg + methodology + estimators-vcov-type tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R4 was ✅ "Looks good" with 2 more P3 informational items; addressing for final cleanliness. P3 #1 (registry overstatement, REGISTRY.md:L2646): The Gate 4-5 lift entry still claimed coverage of "the analytical CR2 contrast DOF used by MultiPeriodDiD.fit() when survey_design= is NOT set and weights are passed via another mechanism" — but MPD has no non-survey weighted public entry point. Reworded to scope the lift to compute_robust_vcov / solve_ols / LinearRegression direct callers, with a separate note that `_compute_cr2_bm_contrast_dof(weights=)` is helper-ready but not exercised by public MPD. P3 #2 (docstring drift, _validate_vcov_args): The Raises block claimed `_validate_vcov_args` itself rejects non-pweight hc2_bm + weights, but the function has no weight_type parameter and the actual enforcement lives in `_compute_robust_vcov_numpy` (which has weight_type in scope). Narrowed the docstring to describe what `_validate_vcov_args` actually validates (conley + weights), with a pointer to where the pweight enforcement happens. All 53 linalg + methodology tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… drift) R5 was ✅ "Looks good" with 1 P3 informational item: the docstring of _compute_bm_dof_from_contrasts still described only the unweighted (tr B)^2 / tr(B^2) formula, but the function body now dispatches the weighted case to the clubSandwich singleton-cluster CR2 P_array form. Split the docstring into Unweighted and Weighted sections matching the two code paths. No code change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…asts (R6 P1) R6 codex review surfaced a P1: the weighted CR2-BM per-coefficient Satterthwaite DOF disagreed with clubSandwich by 15-30% on `weighted_did_absorbed_fe`'s treated-unit dummies (unit2/3/4), even though vcov matched at machine precision. Root cause (after instrumentation against R's get_GH and CR2 source): the contrast vectors for high-leverage FE-dummy coefficients project to near-zero on the design (e.g., unit2's dummy column has XW_g0 row 2 = exact zeros for the unit-1 cluster). The resulting per-cluster H_array slices and P_array entries land at the float64 noise floor (~1e-30 for typical matmul-product roundoff at ~1e-16 per entry). The DOF formula `(tr P)² / sum(P²)` is scale-invariant, but R and NumPy use different BLAS reduction orders, producing 1-bit-different roundoff that propagates into 30% DOF disagreement. Not a fixable algebra bug — fundamental FP precision limit for high-leverage contrasts. Mitigation: detect the noise floor (per-contrast `max(|P|)` below `1e-10 ×` the largest contrast's `max(|P|)`) and return NaN with a `UserWarning`. Honest signal that the DOF cannot be reliably computed instead of silently shipping BLAS-implementation-dependent inference. The coefficient SEs remain valid; only the affected DOF (and any t-test or CI that depends on it) is suppressed. Documented as a precision limit in REGISTRY.md and CHANGELOG.md. New regression test `TestWLSCR2FEDoFNoiseGuard` pins the NaN-guard behavior on the weighted_did_absorbed_fe scenario (unit2/3/4 expected NaN; all other 9 coefficients still match clubSandwich at atol=1e-10). All 339 linalg + estimators + methodology tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…R7 P0) R7 codex flagged a P0: the noise-floor NaN-guard in _cr2_bm_dof_inner_weighted correctly returns NaN DOF, but LinearRegression.get_inference() converted non-finite _bm_dof to df=None, which safe_inference() then treated as normal-theory inference — producing huge t-stats, p≈0, and zero-width CIs for the guarded coefficients instead of suppression. Fix: in get_inference(), when _bm_dof[index] is non-finite (NaN), return InferenceResult with NaN t_stat/p_value/conf_int and df=None directly, short-circuiting the normal-theory fallback. SE and coefficient remain valid (vcov matched at machine precision); only the affected coef's small-sample inference is suppressed. New end-to-end regression test TestLinearRegressionFENanGuardEndToEnd fits the public LinearRegression(vcov_type="hc2_bm", weights=, cluster_ids=) on weighted_did_absorbed_fe and asserts: NaN inference for the 3 treated- unit dummies (the noise-floor cases) AND finite inference for the other 9 coefficients. This catches the exact failure mode R7 surfaced. Also tightens CHANGELOG/REGISTRY wording (R7 P3): explicitly call out that "vcov + non-noise-floor DOF + compound-contrast DOF match clubSandwich"; high-leverage FE-dummy coefficients are suppressed to NaN. All 339 linalg/estimators/methodology tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-20T18:35:42Z

Overall Assessment

✅ Looks good

Executive Summary

The core methodology is aligned with the Methodology Registry: this is a documented clubSandwich WLS-CR2 port, not a textbook PT2018 transform-once implementation, and that deviation is explicitly documented in docs/methodology/REGISTRY.md:L2651-L2656, so it is not a defect.
The main public-surface risks called out in the PR are covered in code and tests: pweight-only gating, zero-total-weight cluster handling, weighted one-way singleton-cluster reduction, and LinearRegression BM-DOF threading.
I found no unmitigated P0/P1 issues in the changed public surfaces (compute_robust_vcov, solve_ols, LinearRegression).
P2 The new noise-floor NaN guard is batch-relative, so weighted single-contrast calls to _compute_cr2_bm_contrast_dof can still emit BLAS-dependent finite DOF even though the Registry/CHANGELOG describe the helper as guarded. diff_diff/linalg.py:L1852-L1882, diff_diff/linalg.py:L1887-L2005, docs/methodology/REGISTRY.md:L2651-L2656, CHANGELOG.md:L13, tests/test_methodology_wls_cr2.py:L169-L215, tests/test_methodology_wls_cr2.py:L476-L533.
I could not run pytest here because the sandbox is missing the Python numerical stack.

Methodology

P2 Impact: _cr2_bm_dof_inner_weighted only suppresses a contrast when its max|P| is tiny relative to a larger contrast in the same batch. For a single weighted contrast, max|P|_overall is the contrast’s own scale, so the 1e-10 × max|P|_overall rule can never classify it as degenerate. That leaves direct single-contrast weighted calls to _compute_cr2_bm_contrast_dof able to return BLAS-dependent finite DOF on the exact noise-floor case the docs say should be suppressed. Current public estimator surfaces do not appear to rely on that path, so this is non-blocking. Concrete fix: add a single-contrast instability criterion (absolute/self-scaled, not batch-relative), or narrow the docs to say the NaN guard applies only to batched per-coefficient calls; add a regression test with a weighted FE-dummy single contrast. diff_diff/linalg.py:L1852-L1882, diff_diff/linalg.py:L1887-L2005, docs/methodology/REGISTRY.md:L2651-L2656
P3 Impact: the documented deviation from textbook PT2018 to clubSandwich-specific WLS algebra is clearly disclosed and matches the implementation, so I am treating it as an accepted implementation choice, not a defect. Concrete fix: none. diff_diff/linalg.py:L1495-L1654, docs/methodology/REGISTRY.md:L2651-L2656, CHANGELOG.md:L13

Code Quality

No findings.

Performance

P3 Impact: LinearRegression.fit() now pays for CR2 twice on the new weighted hc2_bm path: once inside solve_ols(..., return_vcov=True) and again via compute_robust_vcov(..., return_dof=True) to fill _bm_dof. Correctness is fine, but it adds a second full CR2 pass on the heaviest new path. Concrete fix: let solve_ols optionally return DOF from the same CR2 computation, or cache/share the CR2 precomputes. diff_diff/linalg.py:L3235-L3276

Maintainability

No separate findings beyond the duplicated CR2 work above.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

No documentation defect: the Registry, TODO lift, and changelog all match the implemented pweight-only contract and the clubSandwich-specific algebra. docs/methodology/REGISTRY.md:L2651-L2656, TODO.md:L104-L110, CHANGELOG.md:L13
Test coverage is strong for parity, weight-type rejection, zero-weight clusters, unweighted regression safety, and LinearRegression NaN propagation; the remaining gap is the single-contrast noise-floor case noted above. tests/test_methodology_wls_cr2.py:L73-L215, tests/test_methodology_wls_cr2.py:L218-L600
I did not execute the suite here because the sandbox lacks NumPy/pytest.

Assumption

I treated _compute_cr2_bm_contrast_dof as an internal helper. If it is considered supported API rather than internal implementation detail, I would raise the single-contrast noise-floor issue one severity level.

CI codex on PR #475 (✅ verdict) flagged a real P2: the noise-floor NaN- guard in `_cr2_bm_dof_inner_weighted` was batch-relative only — for a single-contrast call to `_compute_cr2_bm_contrast_dof`, `max|P|_overall` equals the contrast's own max|P|, so the `1e-10 × max|P|_overall` rule could never classify it as degenerate. That left direct single-contrast weighted callers (e.g., MPD avg_att) unprotected: they could still emit BLAS-implementation-dependent finite DOF on noise-floor contrasts even though the registry/changelog said the helper was guarded. Fix: union the batch-relative criterion with an absolute floor scaled to the bread matrix's magnitude: `(EPS × n × k × max(bread_inv_scale, 1))²`. This covers the worst-case dgemm accumulation roundoff floor for `H1/H2/H3 @ contrast` products. A single-contrast call now correctly fires the NaN-guard on a high-leverage FE-dummy contrast. New regression tests in `tests/test_methodology_wls_cr2.py:: TestWLSCR2SingleContrastNoiseFloor` (2 tests): single weighted FE-dummy contrast triggers NaN-guard + warning; single non-noise contrast still returns finite DOF matching clubSandwich at atol=1e-10. CI codex P3 (perf): LinearRegression.fit() pays CR2 twice on the new weighted hc2_bm path (solve_ols + compute_robust_vcov). Added as a TODO follow-up row (PR #475 follow-up, Low priority). All 198 linalg+methodology+estimators-vcov-type+TWFE+SA tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-20T19:08:55Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 8f00e976d33dcd3600239bebcdfaf8ce5d72960b

Overall Assessment

⛔ Blocker

Executive Summary

The prior non-blocking single-contrast noise-floor concern appears resolved: _cr2_bm_dof_inner_weighted() now has an absolute single-contrast floor, and the new regression tests cover both the guarded and non-guarded single-contrast cases. diff_diff/linalg.py:L1852-L1903, tests/test_methodology_wls_cr2.py:L544-L623
P0 The new weighted clustered CR2 path is still wrong on supported pweight inputs when a positive-weight cluster contains some zero-weight padding rows. Those zero-weight rows still enter the CR2 adjustment blocks and can silently change SE/DOF instead of being inert. diff_diff/linalg.py:L1548-L1654, diff_diff/linalg.py:L1951-L2024, diff_diff/linalg.py:L2179-L2188
The documented methodology deviation from a textbook PT2018 transform-once reading is properly recorded in the Methodology Registry, so I do not treat that clubSandwich-specific algebra choice as a defect. docs/methodology/REGISTRY.md:L2651-L2656
Weight-type rejection (aweight/fweight) and LinearRegression._bm_dof threading look correctly implemented and covered on the changed public paths. diff_diff/linalg.py:L2263-L2317, diff_diff/linalg.py:L3272-L3315, tests/test_methodology_wls_cr2.py:L214-L298, tests/test_methodology_wls_cr2.py:L358-L457
I could not run pytest here because the sandbox does not have NumPy/pytest.

Methodology

Severity: P0. Impact: weighted clustered CR2 is not subpopulation-invariant on supported pweight inputs with mixed zero/nonzero weights inside a cluster. The new code drops only zero-total-weight clusters, but it still builds S_W, H_gg, G_g, and A_g on X_g for all rows in a positive-weight cluster. Because zero-weight rows remain in the row side of H_gg and in bias_term = X_g @ MUWTWUM @ X_g.T, excluded observations can still alter the adjustment matrix and therefore the reported vcov/DOF. That is silent wrong statistical output on an allowed input, and it contradicts the surrounding linalg contract that zero-weight rows “contribute nothing to the sandwich.” The new tests only cover all-zero clusters, not this mixed-zero case. Concrete fix: before weighted clustered CR2/contrast-DOF construction, filter to weights > 0 (globally or per cluster) so mixed-zero rows are fully removed from X, residuals, cluster_ids, and weights; then add an invariance test against the physically dropped positive-weight subset. Refs: diff_diff/linalg.py:L1548-L1654, diff_diff/linalg.py:L1951-L2024, diff_diff/linalg.py:L2179-L2188, tests/test_methodology_wls_cr2.py:L301-L355, docs/methodology/REGISTRY.md:L2651-L2656
Severity: P3 informational. Impact: the clubSandwich-specific WLS algebra (W, not sqrt(W); W^2 bias term; unweighted residual score) is documented in the Registry, so that deviation from a textbook reading is acceptable and not a defect. Concrete fix: none. Refs: docs/methodology/REGISTRY.md:L2651-L2656, diff_diff/linalg.py:L1504-L1527

Code Quality

No findings.

Performance

Severity: P3 informational, tracked in TODO.md. Impact: LinearRegression.fit() still pays for CR2 twice on the weighted hc2_bm path: once via solve_ols(..., return_vcov=True) and again via compute_robust_vcov(..., return_dof=True) to populate _bm_dof. Concrete fix: thread return_dof through solve_ols or cache/share the CR2 precomputes. Refs: diff_diff/linalg.py:L3251-L3295, TODO.md:L111-L111

Maintainability

No findings.

Tech Debt

Severity: P3. Impact: TODO.md now says rows 104-105 are lifted, but Tier C still separately lists “Weighted one-way Bell-McCaffrey + weighted CR2 Bell-McCaffrey” as outstanding, so the deferred-work inventory is self-contradictory. Concrete fix: remove or rewrite the stale Tier C bullet. Refs: TODO.md:L104-L111, TODO.md:L210-L210

Security

No findings.

Documentation/Tests

Severity: P3. Impact: the Registry and changelog still describe the noise-floor guard only as the batch-relative 1e-10 × max|P| rule, but the shipped code now also applies an absolute single-contrast floor to fix the previous review item. The docs therefore underdescribe when NaN suppression can fire on direct single-contrast calls. Concrete fix: update REGISTRY.md and CHANGELOG.md to mention the absolute single-contrast floor (or describe the full max(relative_floor, abs_noise_floor) rule). Refs: diff_diff/linalg.py:L1862-L1885, docs/methodology/REGISTRY.md:L2651-L2651, CHANGELOG.md:L13-L13, tests/test_methodology_wls_cr2.py:L544-L623
I did not execute the suite here because NumPy/pytest are unavailable in the sandbox.

Path to Approval

Make weighted clustered CR2 ignore zero-weight rows inside otherwise positive-weight clusters, not just zero-total-weight clusters.
Add regression tests that compare clustered vcov_type="hc2_bm" + weight_type="pweight" on mixed-zero clusters to the physically dropped positive-weight subset, covering both per-coefficient DOF and contrast DOF on at least one public entry point.

… P0 R2) CI codex round 2 on PR #475 flagged a P0: weighted clustered CR2 wasn't subpopulation-invariant on mixed-zero clusters. The earlier "drop zero- total-weight clusters" guard handled all-zero clusters but missed mixed- zero clusters (positive total weight, some zero-weight rows inside). In those clusters, zero-weight rows still entered the CR2 adjustment matrices (H_gg, G_g, A_g, bias_term) on the row side, silently changing SE/DOF — contradicting the linalg contract that zero-weight rows are inert. Fix: physically filter `weights > 0` rows before all per-cluster computations in both `_compute_cr2_bm` and `_compute_cr2_bm_contrast_dof`. The caller's `bread_matrix = X.T @ (X * w[:, None])` is invariant to zero-weight row removal (those rows contribute 0 to the sum), so no bread rebuild is needed. Effective-cluster filter still applies on the filtered view. New regression tests `TestWLSCR2SubpopulationInvariance` (2 tests): - `test_per_coefficient_dof_invariant_to_zero_weight_padding`: pin vcov + per-coefficient DOF at atol=1e-12 between (a) computing on the full design with zero-weight padding rows interleaved and (b) computing on the physically dropped positive-weight subset. - `test_contrast_dof_invariant_to_zero_weight_padding`: same invariance for compound-contrast DOF. Also addresses two P3 doc items from the same review: - REGISTRY + CHANGELOG noise-floor description: now explicitly mentions both criteria (batch-relative AND absolute single-contrast safe) instead of only the batch-relative one. - TODO.md Tier C self-contradiction: marked the WLS-CR2 line as LIFTED with a back-reference to closed rows 109-110, removing the contradiction between the rows table (lifted) and the prose Tier C list (still outstanding). P3 performance (LinearRegression pays CR2 twice on weighted hc2_bm path) remains tracked in TODO.md row 111 from the previous round. All 200 linalg+methodology+TWFE+SA+estimators-vcov-type tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-20T19:21:04Z

🔁 AI review rerun (requested by @igerber)

Head SHA: a2fa8016660437d5972b394a32ca4965f4bd6c88

Overall Assessment

✅ Looks good

Executive Summary

The prior blocker from the last review appears resolved: both weighted CR2 builders now physically drop weights == 0 rows before any cluster-block construction, and the new invariance tests cover the mixed-zero case.
The clubSandwich-specific WLS-CR2 algebra and the FE-dummy noise-floor NaN guard are documented in the Methodology Registry, so I do not treat those deviations from a textbook PT2018 reading as defects.
LinearRegression now threads weighted Bell-McCaffrey DOF into get_inference() and returns all-NaN inference when the noise-floor guard fires, which fixes the earlier fallback-to-n-k / normal-theory problem.
Remaining items are non-blocking: the Registry row should explicitly say the lift is pweight-only, and weighted hc2_bm still recomputes CR2 twice in LinearRegression.fit() (already tracked in TODO.md).
I could not run the test suite here because the sandbox lacks NumPy/pytest.

Methodology

No P0/P1 findings. The registry now documents the load-bearing methodology choices and precision-limit behavior in docs/methodology/REGISTRY.md:L2651-L2656, and the prior mixed-zero-weight blocker appears resolved by the new weights > 0 filtering in both weighted CR2 builders diff_diff/linalg.py:L1548-L1583 and diff_diff/linalg.py:L1969-L1993, with regression coverage in tests/test_methodology_wls_cr2.py:L536-L724.

Severity: P3. Impact: The new Phase 1a registry row says weighted hc2_bm + weights is supported on the analytical linalg surface, but unlike the in-code docs it does not record the load-bearing pweight-only restriction. That overstates the supported surface in the project’s methodology registry even though the code correctly rejects aweight and fweight. Refs: docs/methodology/REGISTRY.md:L2651-L2656, diff_diff/linalg.py:L1259-L1267. Concrete fix: Add one sentence to the Phase 1a registry row stating that the clubSandwich WLS-CR2 port matches weight_type="pweight" only and that aweight / fweight remain intentionally unsupported.

Code Quality

No findings.

Performance

Severity: P3. Impact: LinearRegression.fit() still pays for the weighted hc2_bm CR2 path twice: once via solve_ols(..., return_vcov=True) and again via compute_robust_vcov(..., return_dof=True) to populate _bm_dof. This is correct but redundant on the newly opened path. Refs: diff_diff/linalg.py:L3306-L3336, TODO.md:L111-L111. Concrete fix: Thread return_dof through solve_ols() or cache/share the weighted-CR2 precomputes (A_g, MUWTWUM) between the vcov and DOF calls.

Maintainability

No findings.

Tech Debt

No additional findings beyond the tracked weighted-CR2 double-compute follow-up in TODO.md:L111-L111.

Security

No findings.

Documentation/Tests

No blocking findings. The new methodology harness covers parity, weight-type rejection, effective-cluster rejection, mixed-zero-row subpopulation invariance, single-contrast noise-floor guarding, and LinearRegression end-to-end NaN inference propagation in tests/test_methodology_wls_cr2.py:L218-L808.

I could not execute the suite here because the sandbox lacks NumPy/pytest.

CI codex R3 on PR #475 (✅ verdict) flagged a single actionable P3 docs item: the Phase 1a registry row said "Weighted Bell-McCaffrey is supported" but didn't record the load-bearing `pweight`-only restriction. The code correctly rejects `aweight`/`fweight` at the linalg validator, but the registry overstated the supported surface. Fix: added a sentence to the Phase 1a row stating the clubSandwich WLS-CR2 port matches the `pweight` convention only; `aweight`/`fweight` remain intentionally unsupported (CR1 / `vcov_type="hc1"` still supports all three). CI codex R3 P3 perf (LinearRegression pays CR2 twice on weighted hc2_bm path) is already tracked in `TODO.md` row 111 from R2; no new action. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-20T19:56:36Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 4fe217e8d2cbb1cf9e4b3347ff0025fd602a71a6

Overall Assessment

✅ Looks good

Executive Summary

The prior re-review’s remaining documentation nit is resolved: the Phase 1a registry row now explicitly scopes weighted hc2_bm to the analytical linalg surface and weight_type="pweight" only, matching the new code paths.
I did not find any unmitigated P0/P1 issues in the weighted CR2 lift. The two earlier correctness risks this PR family had been carrying are addressed in the changed code: zero-weight padded rows are dropped before CR2 block algebra, and LinearRegression.get_inference() now suppresses all inference fields when BM DOF is NaN instead of silently falling back to normal-theory inference.
The methodology change is consistent with the cited source material and the Methodology Registry; I do not treat the documented clubSandwich-specific algebra or the documented noise-floor NaN behavior as defects.
Test coverage is strong for the new surface: R-golden parity, weight-type rejection, effective-cluster rejection, zero-weight subpopulation invariance, noise-floor guarding, and LinearRegression end-to-end threading.
The only remaining issue I saw is the already-tracked double-compute on weighted hc2_bm inside LinearRegression.fit(), which is P3/non-blocking.

Methodology

No P0/P1 findings. The weighted CR2 lift matches the clubSandwich source path the PR says it ports: weighted lm objects default to the non-inverse_var branch, CR2 builds the weighted H_jj / MUWTWUM adjustment, and get_GH + get_P_array + Satterthwaite_df define the H-array/P-array Satterthwaite DOF used by coef_test. The local implementation in diff_diff/linalg.py:L1495-L2052 and the registry note in docs/methodology/REGISTRY.md:L2651-L2656 are consistent with that contract. (rdrr.io)

Code Quality

No findings.

Performance

P3 Impact: LinearRegression.fit() still computes weighted CR2 twice on the opened hc2_bm path, once through solve_ols(..., return_vcov=True) and again through compute_robust_vcov(..., return_dof=True). This is already tracked in TODO.md:L111-L111, so it is mitigated and non-blocking. Concrete fix: thread return_dof through solve_ols() or cache the CR2 precomputes reused by both calls. Refs: diff_diff/linalg.py:L3300-L3339, TODO.md:L111-L111.

Maintainability

No findings.

Tech Debt

No new untracked findings.

Security

No findings.

Documentation/Tests

No findings. The prior registry scope gap is fixed in docs/methodology/REGISTRY.md:L2651-L2656, and the new coverage in tests/test_methodology_wls_cr2.py:L73-L884 plus tests/test_linalg_hc2_bm.py:L302-L429 hits the right edge cases for this lift. I could not execute the tests here because this environment lacks numpy and pytest.

Adds vcov_type ∈ {"classical","hc1","hc2","hc2_bm"} to StackedDiD with clubSandwich R parity at atol=1e-10 on the hc2_bm path. Defaults to hc1 (prior behavior at machine precision). Source edits: - __init__ accepts vcov_type=, rejects classical/hc2 with cluster- incompatibility ValueError (StackedDiD clusters intrinsically at unit or unit_subexp; one-way families don't compose with cluster_ids per the linalg validator), rejects conley with a deferral message. - fit() rejects survey_design + vcov_type != "hc1" with NotImplementedError (survey TSL/replicate-refit overrides analytical sandwich; SA precedent). Reject order locked: fweight/aweight check fires first, then vcov check (pinned by test_aweight_plus_hc2_bm_*). - Main solve_ols call at :419: switched from bake-Q-into-X (X_t = X*sqrt(Q)) to explicit `solve_ols(X, Y, weights=composed_weights, vcov_type=self.vcov_type)`. solve_ols internally bakes Q for the coef solve AND back-transforms for vcov on original-scale data via clubSandwich's WLS-CR2 algebra for hc2_bm (PR igerber#475). - _refit_stacked closure at :444: mirror switch (cosmetic per return_vcov=False but grep-consistency). - StackedDiDResults gains vcov_type field; get_params() includes it. - llms-full.txt agent-facing entry updated. R parity: - benchmarks/data/stacked_did_test_panel.csv (NEW) — pre-stacked panel generated by extracting StackedDiD's internal stacked_data via the fixed-seed Python fixture (n_units=50, n_periods=8, cohorts=[3,5,7], seed=20260521; 325 rows after stacking). - benchmarks/R/generate_stacked_did_golden.R (NEW) — loads CSV, fits lm(weights=Q,...) with the same event-study design as Python, computes CR1S (Stata-style) + CR2-BM SE + BM df_Satt at cluster=unit AND cluster=unit_subexp via clubSandwich >= 0.7.0. - benchmarks/data/stacked_did_golden.json (NEW) — committed goldens. Tests: - tests/test_stacked_did.py::TestStackedDiDVcovType (19 tests): default + bit-equality + reject paths + survey precedence + get_params/set_params + result-class field + clone idempotency + replicate-refit smoke + reject-order regression + unit_subexp + survey symmetric. - tests/test_methodology_stacked_did.py (4 tests, NEW): hc1 vs CR1S, hc2_bm vs CR2 (unit), BM DOF vs coef_test()$df_Satt (via t-dist inversion of CI half-width), hc2_bm vs CR2 (unit_subexp). - Tolerance note: hc1 vs prior bake-Q-into-X is bit-equal up to ~2 ULPs at SE scale (multiplication ordering in solve_ols(weights=) vs prior user-side bake-w); pinned by test_hc1_se_bit_equal_to_pre_pr_baseline at atol=1e-13. Docs: - REGISTRY.md StackedDiD section: new "Variance families" subsection documenting hc1/hc2_bm routing + rejects, with **Note:** deferring methodology framing to Phase 1a clubSandwich port (PR igerber#475) — no new methodology synthesis introduced in this PR. - CHANGELOG.md: Unreleased Added bullet at top. - TODO.md: Phase 1b row updated to track remaining 6 estimators; new StackedDiD conley follow-up row mirroring SA precedent. 234 tests pass across stacked + methodology_stacked_did + methodology_sun_abraham + methodology_twfe + methodology_wls_cr2 + estimators_vcov_type. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Local codex R3 caught a P1 on the same surface: when BM contrast DOF was unavailable (helper raised OR noise-floor NaN guard fired), StackedDiD either fell back to normal-theory inference (silent wrong CIs/p-values under the hc2_bm contract) or passed NaN df through safe_inference (which guards df<=0 but NOT NaN, producing finite t_stat with NaN p/CI — mixed inconsistent inference fields). Both fail-open patterns contradict the registry contract that hc2_bm aggregated inference uses BM DOF. Fix mirrors the LinearRegression.get_inference() pattern from PR igerber#475 R7 (linalg.py:3689-3706): on the hc2_bm path, when the per-contrast BM DOF is None or non-finite, emit ALL-NaN inference fields for that contrast rather than falling back to safe_inference(df=None or NaN). Effect and SE remain finite — only the inference fields downstream of the DOF are suppressed. Applied at both inference sites: - Event-study per-event-time inference (event_study_effects[h]): inline check before safe_inference call. - Overall ATT inference (overall_t/p/conf_int): same guard. New regression tests: - test_hc2_bm_nan_dof_fails_closed_with_all_nan_inference: monkeypatches `_compute_cr2_bm_contrast_dof` to return a NaN-only vector and asserts ALL event-study + overall t_stat/p_value/conf_int are NaN while effect and se remain finite. - test_hc2_bm_helper_raises_fails_closed_with_all_nan_inference: monkeypatches the helper to raise LinAlgError and asserts the fallback path NaN-closes inference (does NOT use normal-theory). 77 tests pass. Pattern is consistent with the analytical-surface fail-closed contract added in PR igerber#475 R7. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Release notes consolidate 8 PRs since 3.4.0 (2026-05-19): Public-surface variance lifts: - SpilloverDiD survey_design on HC1/CR1 via Binder TSL (Wave E.1, igerber#468) - SpilloverDiD vcov_type=conley + survey_design via stratified-Conley on PSU totals (Wave E.2, igerber#474) + lag_cutoff>0 follow-up (igerber#477) - SunAbraham vcov_type ∈ {classical, hc1, hc2, hc2_bm} (Phase 1b 1/8, igerber#472) - WLS-CR2 Bell-McCaffrey gates lifted via clubSandwich port (igerber#475) Methodology-review-tracker promotions (mostly docs/tests): - PreTrendsPower R pretrends parity goldens (PR-C, igerber#471) - HAD methodology-review-tracker promotion (igerber#473) - ContinuousDiD methodology-review-tracker promotion (igerber#476) All changes additive; bit-equal defaults preserved across the affected estimators. No new estimators (patch-level per semver convention). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber and others added 8 commits May 20, 2026 14:23

igerber added the ready-for-ci Triggers CI test workflows label May 20, 2026

igerber merged commit 955aa4b into main May 21, 2026
33 of 34 checks passed

igerber deleted the wls-cr2-clubsandwich-port branch May 21, 2026 00:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lift hc2_bm + weights gates via clubSandwich WLS-CR2 port#475

Lift hc2_bm + weights gates via clubSandwich WLS-CR2 port#475
igerber merged 11 commits into
mainfrom
wls-cr2-clubsandwich-port

igerber commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented May 20, 2026

Summary

Methodology references

Validation

Security / privacy

Test plan

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant